Scripts & Frames
- Oceania > Australia (0.14)
- Asia > Middle East > Jordan (0.04)
- Health & Medicine (0.52)
- Leisure & Entertainment > Games (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Paper: Generalization of Reinforcement Learners with Working and Episodic Memory
We thank the reviewers for their thoughtful and constructive feedback on our manuscript. This should help both contextualize each task's difficulty and illustrate what it involves. Reviewer 3 noted the Section 2 task descriptions could be better presented. We have reformatted it so that "the order We also changed our description of IMP ALA to match Reviewer 5's suggestion. Regarding the task suite, Reviewer 4 raised a thoughtful consideration on whether "most of the findings translate when Some 3D tasks in the suite already have '2D-like' semi-counterparts that do not require navigation, '2D-like' because everything is fully observable and the agent has a first-person point of view from a fixed point, without Spot the Difference level, was overall harder than Change Detection for our ablation models.
- North America > United States > Iowa > Johnson County > Iowa City (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Health & Medicine > Consumer Health (0.41)
- Education > Educational Setting > Continuing Education (0.41)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.41)
GENESIS: A Generative Model of Episodic-Semantic Interaction
D'Alessandro, Marco, D'Amato, Leo, Elkano, Mikel, Uriz, Mikel, Pezzulo, Giovanni
A central challenge in cognitive neuroscience is to explain how semantic and episodic memory, two major forms of declarative memory, typically associated with cortical and hippocampal processing, interact to support learning, recall, and imagination. Despite significant advances, we still lack a unified computational framework that jointly accounts for core empirical phenomena across both semantic and episodic processing domains. Here, we introduce the Generative Episodic-Semantic Integration System (GENESIS), a computational model that formalizes memory as the interaction between two limited-capacity generative systems: a Cortical-VAE, supporting semantic learning and generalization, and a Hippocampal-VAE, supporting episodic encoding and retrieval within a retrieval-augmented generation (RAG) architecture. GENESIS reproduces hallmark behavioral findings, including generalization in semantic memory, recognition, serial recall effects and gist-based distortions in episodic memory, and constructive episodic simulation, while capturing their dynamic interactions. The model elucidates how capacity constraints shape the fidelity and memorability of experiences, how semantic processing introduces systematic distortions in episodic recall, and how episodic replay can recombine previous experiences. Together, these results provide a principled account of memory as an active, constructive, and resource-bounded process. GENESIS thus advances a unified theoretical framework that bridges semantic and episodic memory, offering new insights into the generative foundations of human cognition.
- North America > United States > New York (0.04)
- Europe > Italy > Lazio > Rome (0.04)
- Asia > Middle East > Jordan (0.04)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Consumer Health (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.79)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment
Yin, Hao, Gu, Lijun, Parmar, Paritosh, Xu, Lin, Guo, Tianxiao, Fu, Weiwei, Zhang, Yang, Zheng, Tianyou
Action Quality Assessment (AQA) -- the task of quantifying how well an action is performed -- has great potential for detecting errors in gym weight training, where accurate feedback is critical to prevent injuries and maximize gains. Existing AQA datasets, however, are limited to single-view competitive sports and RGB video, lacking multimodal signals and professional assessment of fitness actions. We introduce FLEX, the first large-scale, multimodal, multiview dataset for fitness AQA that incorporates surface electromyography (sEMG). FLEX contains over 7,500 multiview recordings of 20 weight-loaded exercises performed by 38 subjects of diverse skill levels, with synchronized RGB video, 3D pose, sEMG, and physiological signals. Expert annotations are organized into a Fitness Knowledge Graph (FKG) linking actions, key steps, error types, and feedback, supporting a compositional scoring function for interpretable quality assessment. FLEX enables multimodal fusion, cross-modal prediction -- including the novel Video$\rightarrow$EMG task -- and biomechanically oriented representation learning. Building on the FKG, we further introduce FLEX-VideoQA, a structured question-answering benchmark with hierarchical queries that drive cross-modal reasoning in vision-language models. Baseline experiments demonstrate that multimodal inputs, multiview video, and fine-grained annotations significantly enhance AQA performance. FLEX thus advances AQA toward richer multimodal settings and provides a foundation for AI-powered fitness assessment and coaching. Dataset and code are available at \href{https://github.com/HaoYin116/FLEX}{https://github.com/HaoYin116/FLEX}. Link to Project \href{https://haoyin116.github.io/FLEX_Dataset}{page}.
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.46)
- Health & Medicine > Consumer Health (1.00)
- Leisure & Entertainment > Sports (0.88)
- Health & Medicine > Diagnostic Medicine > Imaging (0.48)
- Health & Medicine > Therapeutic Area > Neurology (0.34)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.40)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.34)
One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration
Khan, Zaid, Prasad, Archiki, Stengel-Eskin, Elias, Cho, Jaemin, Bansal, Mohit
Symbolic world modeling requires inferring and representing an environment's transitional dynamics as an executable program. Prior work has focused on largely deterministic environments with abundant interaction data, simple mechanics, and human guidance. We address a more realistic and challenging setting, learning in a complex, stochastic environment where the agent has only "one life" to explore a hostile environment without human guidance. We introduce OneLife, a framework that models world dynamics through conditionally-activated programmatic laws within a probabilistic programming framework. Each law operates through a precondition-effect structure, activating in relevant world states. This creates a dynamic computation graph that routes inference and optimization only through relevant laws, avoiding scaling challenges when all laws contribute to predictions about a complex, hierarchical state, and enabling the learning of stochastic dynamics even with sparse rule activation. To evaluate our approach under these demanding constraints, we introduce a new evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter-OO, our reimplementation of the Crafter environment that exposes a structured, object-oriented symbolic state and a pure transition function that operates on that state alone. OneLife can successfully learn key environment dynamics from minimal, unguided interaction, outperforming a strong baseline on 16 out of 23 scenarios tested. We also test OneLife's planning ability, with simulated rollouts successfully identifying superior strategies. Our work establishes a foundation for autonomously constructing programmatic world models of unknown, complex environments.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report (0.64)
- Workflow (0.46)
- Leisure & Entertainment > Games > Computer Games (0.93)
- Law (0.68)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.34)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.34)
BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
Aman, Euhid, Carlin, Esteban, Pao, Hsing-Kuo, Beltrame, Giovanni, Sari, Ghaluh Indah Permata, Chen, Yie-Tarng
Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per-layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.
- Asia > Taiwan (0.41)
- North America > United States > Mississippi (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.86)
A Biologically Interpretable Cognitive Architecture for Online Structuring of Episodic Memories into Cognitive Maps
Dzhivelikian, E. A., Panov, A. I.
Cognitive maps provide a powerful framework for understanding spatial and abstract reasoning in biological and artificial agents. While recent computational models link cognitive maps to hippocampal-entorhinal mechanisms, they often rely on global optimization rules (e.g., backpropagation) that lack biological plausibility. In this work, we propose a novel cognitive architecture for structuring episodic memories into cognitive maps using local, Hebbian-like learning rules, compatible with neural substrate constraints. Our model integrates the Successor Features framework with episodic memories, enabling incremental, online learning through agent-environment interaction. We demonstrate its efficacy in a partially observable grid-world, where the architecture autonomously organizes memories into structured representations without centralized optimization. This work bridges computational neuroscience and AI, offering a biologically grounded approach to cognitive map formation in artificial adaptive agents.
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Asia > Russia (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.79)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Paper: Generalization of Reinforcement Learners with Working and Episodic Memory
We thank the reviewers for their thoughtful and constructive feedback on our manuscript. This should help both contextualize each task's difficulty and illustrate what it involves. Reviewer 3 noted the Section 2 task descriptions could be better presented. We have reformatted it so that "the order We also changed our description of IMP ALA to match Reviewer 5's suggestion. Regarding the task suite, Reviewer 4 raised a thoughtful consideration on whether "most of the findings translate when Some 3D tasks in the suite already have '2D-like' semi-counterparts that do not require navigation, '2D-like' because everything is fully observable and the agent has a first-person point of view from a fixed point, without Spot the Difference level, was overall harder than Change Detection for our ablation models.
STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation
Zhu, Haokun, Li, Zongtai, Liu, Zhixuan, Wang, Wenshan, Zhang, Ji, Francis, Jonathan, Oh, Jean
Figure 1: STRIVE can conduct zero-shot object navigation in diverse and complex real-world environments by leveraging our novel multi-layer representation and efficient two-stage navigation policy. Abstract-- Vision-Language Models (VLMs) have been increasingly integrated into object navigation tasks for their rich prior knowledge and strong reasoning abilities. However, applying VLMs to navigation presents two key challenges: effectively parsing and structuring complex environment information and determining when and how to query VLMs. T o address these challenges, we propose a novel framework that incrementally constructs a multi-layer environment representation consisting of viewpoints, object nodes, and room nodes during navigation. Viewpoints and object nodes facilitate intra-room exploration and accurate target localization, while room nodes support efficient inter-room planning. Building on this structured representation, we propose a novel two-stage navigation policy, integrating high-level planning guided by VLM reasoning with low-level VLM-assisted exploration to efficiently and reliably locate a goal object. Object navigation is a fundamental task in robotics, where an agent must locate an instance of a given object category in unknown environments. This task is particularly challenging, as it requires the agent to understand complex visual information, reason about spatial relationships, and make decisions based on both current and past observations. Advances in Vision-Language Models (VLMs) [1], [2], [3] have demonstrated strong capabilities in contextual visual understanding and common-sense reasoning. However, existing approaches often face two significant challenges: First, the input to VLMs typically lacks a structured representation of the environment and is often restricted to local observations.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.81)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)